A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. However, SVMs are more commonly used in classification problems and as such, this is what we will focus on in this session.
In the following example, we are trying to separate the data with any of the three lines $H_1$, $H_2$ or $H_3$, and then just assign classes based on whether the observation lies above or below the line. Which line you think is doing the best job ?
As we can see, $H_1$ does not separate the two classes accurately. $H_2$ does separates but only with a small distance.$H_3$ separates them with the maximum distance. In other words, $H_3$ seems to be the most confident in searating the two classes.
SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.
What is a hyperplane?
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.
Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.
So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.
Support Vectors: Support vectors are the data points (vector points) nearest to the hyperplane. These points are called the support vectors, because only these points are the data observations that “support”, or determine, the decision boundary. Since only these two points are contributing to the result of the algorithm, other points are not, they can be considered the critical elements of a data set.
Margin : The margin is defined as the distance between the separating hyperplane (decision boundary) and the training samples (support vectors) that are closest to this hyperplane.
How do we find the right hyperplane?
Asking this question in similar to asking, how do we best segregate the two classes within the data?
The perpendicular distance between the hyperplane and the nearest data point (support vectors) from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.
For example, lets look at the figure above again,
Note, finding the largest possible margin allows more accurate classification of new points, making the model a lot more robust. You can see that the new grey point would be assigned correctly to the blue class when using the “H3” hyperplane.
Technically hyperplane can also be called as margin maximizing hyperplane.
Mathematically,
Hyperplane is an $(p - 1)$ dimensional subspace for an $p$-dimensional space. The distance between either side of the dashed line to the solid line is the margin.
For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In $p > 3$ dimensions, it can be hard to visualize a hyperplane, but the notion of a $(p - 1)$-dimensional flat subspace still applies.
Thus the dimension of the hyperplane depends upon the number of features. For a 2-dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-dimension space, its hyperplane will be 2-dimension, which is a plane that slice the cube. It becomes difficult to imagine when the number of features exceeds 3.
Any Hyperplane can be written mathematically as:
$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 + . . . + \beta_p x_p = 0$$For a 2-dimensional space, the Hyperplane, which is the line.
$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0$$The data points above this line, are those $x_1$, $x_2$ satisfy the formula:
$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0$$By similar logic, data points below this line satisfies:
$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 < 0$$Assuming the label $y$ is either 1 (for green) or -1 (for red), all those three lines below are separating hyperplanes. Because they all share the same property — above the line, is green; below the line, is red.
This property can be written in math again as followed:
$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0 \quad if \quad y=1 \quad (Green)$$ $$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0 \quad if \quad y=-1 \quad (Red)$$If we further generalize these two into one, it becomes:
$$y\;(\beta_0 + \beta_1 x_1 + \beta_2 x_2)>0$$Sometimes, it may not be possible to separate the two classes perfectly. In such scenarios, a soft-margin is used where some points are allowed to be misclassified or to fall inside the margin.
Using this example, we can see that the “H4” hyperplane treats the green point inside the margin as an outlier. Hence, the support vectors are the two green points closer to the main group of green points. This allows a larger margin to exist, increasing the model’s robustness.
Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to balance the trade-off between finding a line that maximizes the margin and minimizes the misclassification.
How much tolerance(soft) we want to give when finding the decision boundary is an important hyper-parameter for the SVM. In Sklearn, it is represented as the penalty term — C.
The C parameter allowed you to decide how much you want to penalize is misclassified points. The bigger the C, the more penalty SVM gets when it makes misclassification.
CAUTION : Beware, while setting a high value for C is likely to lead to a better model performance on the training data, there is a high risk of overfitting the model, producing poor results on the test data.
So, in such a case we try to find a line to separate, but tolerate one or few misclassified dots (e.g. the dots circled in red).
You can see the difference in the two models. On the right the model just allows a point to be in the classified and on the left there's no miss-classification.
So on the right, with a low C parameter we are prioritizing simplicity so this line here is also called a line with a soft margin because we're allowing that miss classification. On the left, we're prioritizing making no mistakes but in this case we're probably overfitting.
Low C : SVM model with low value of C cares almost exclusively about maximizing margin and care less about the close points.
Medium C : By setting a medium C, we're willing to give up getting the one orange point correctly classified to
have a larger separation between the points that we actually succeeded in classifying and according to a medium C
this point would actually be blue.
High C : Using a high C to train our support vector machine in this case we put the decision boundary between an orange point and its immediate neighbor the blue point and that's perfectly classified all of our examples. In other way, this gave us a
training accuracy of 100%, essentially everything we use to train the model we predicted correctly. But apart from training accuracy we'd also like to see how the model does on data that it hasn't been exposed to for training i.e., the test data.
In simple words, parameter C allows us to control how much we want to trade off a large separation between classified examples and a perfect classification. So if you have a really high C you're saying I care a lot about getting every single point precisely correct.
But how do you pick the best C parameter? Well what you need to do is try different values see all the way what's the best C value is for your model.
As suggested in sklearn's SVC documentation:
Setting C: C is 1 by default and it’s a reasonable default choice.
If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.
Once the classifier drawn, it becomes easier to classify a new data instance. Also, because SVM needs only the support vectors to classify any new data instances, it is quite efficient. In other words, it uses a subset of training points in the decision function ( support vectors), so it is also memory efficient.
In the linearly separable case, SVM is trying to find the hyperplane that maximizes the margin, with the condition that both classes are classified correctly. But in reality, datasets are probably never linearly separable, so the condition of 100% correctly classified by a hyperplane will never be met.
SVM address non-linearly separable cases by introducing two concepts: Soft Margin and Kernel Tricks.
In short,
Another reason why SVMs enjoy high popularity among machine learning practitioners is that they can be easily kernelized to solve nonlinear classification problems.
Obviously, we would not be able to separate samples from the positive and negative class very well using a linear hyperplane as the decision boundary via the linear logistic regression or linear SVM model that we discussed in earlier sections.The solution lies in projecting the samples in higher dimensional.
To understand the concept, let’s take a look at a simple example of data in 1 dimension which is not easy to separate and how adding another dimension makes it easy.
In this example, the picture on the left shows our original data points. In 1-dimension, this data is not linearly separable, but after applying the transformation $\phi(x) = x^2$ and adding this second dimension to our feature space, the classes become linearly separable.
Explaining this is easy with another simplified example.
As you can see, red and blue points are not linearly separable since we cannot draw a line that would put these two classes on different sides of such a line. However, we can separate them by drawing a circle with all the blue points inside it and the red points outside it.
How to transform this problem?
In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view.
Let’s add a third dimension and make it a sum of squared $X_1$ and $X_2$ values i.e., $Z = X_1^2 + X_2^2$ .
Using this three-dimensional space with $X_1$, $X_2$ and $Z$ coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and blue points.
The basic idea behind kernel methods to deal with such linearly inseparable data is to create nonlinear combinations of the original features to project them onto a higher dimensional space via a mapping function $\phi(.)$ where it becomes linearly separable. As shown in the next figure, we can transform a two-dimensional dataset onto a new three-dimensional feature space where the classes become separable via the following projection:
$$\phi(x_1,x_2)=(z_1,z_2,z_3)=(x_1,x_2,{x_1}^2+{x_2}^2)$$This allows us to separate the two classes shown in the plot via a linear hyperplane that becomes a nonlinear decision boundary if we project it back onto the original feature space:
Kernel methods represent the techniques that are used to deal with linearly inseparable data or non-linear data set shown in above examples. The idea is to create nonlinear combinations of the original features to project them onto a higher-dimensional space via a mapping function, $\phi$ , where the data becomes linearly separable.
By using the above method, the training data can be transformed into a higher-dimensional feature space via a mapping function $\phi$ and a linear SVM model can be trained to classify the data in this new feature space. However, this method is computationally very expensive.
The kernel function is a function which can be represented the dot product of the mapping function (kernel method) and can be represented as the following:
$$k(x_i,x_j) = \phi(x_i)\;\phi(x_j)$$The scheme described so far is attractive due to its simplicity. However, considering the computational consequences of increasing the dimensionality, dataset transformations will incur serious computational and memory problems.
Here is a concrete example: the Polynomial Kernel is a kernel often used with SVMs. For a dataset in $\mathbb{R}^2$, a two-degree polynomial kernel (implicitly) performs the transformation $[x_1, x_2] = [{x_1}^2, {x_2}^2, \sqrt{2} \cdot x_1 \cdot x_2, \sqrt{2 \cdot c} \cdot x_1, \sqrt{2 \cdot c} \cdot x_2, c]$. This transformation adds three additional dimensions $\mathbb{R}^2$ $\rightarrow$ $\mathbb{R}^5$. In real applications, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high and impractical computational costs. Usually, the computational cost will increase, if the dimension of the data increases.
Kernel trick owe their name to the use of various kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space. Any linear model can be turned into a non-linear model by applying the kernel trick to the model: replacing its features (predictors) by a kernel function.
So, Kernel Trick is a simple and computationally cheaper method where a Non Linear data is projected onto a higher dimension space so as to make it easier to classify the data where it could be linearly divided by a plane without explicit computation of the coordinates.
Our kernel function accepts inputs in the original lower dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. As long as we can compute the inner product in the feature space, we do not require the mapping explicitly. This operation is often computationally cheaper than the explicit computation of the coordinates and thus called Kernel Trick.
In layman's terms, a kernel function is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.
Kernel function reduces the complexity of finding the mapping function. So, the kernel function defines the inner product in the transformed space. There are out-of-box kernel functions such as some of the following which can be applied for training models using the SVM algorithm:
In vector form, they are represented as :
Think of the polynomial kernel as a transformer to generate new features by applying the polynomial combination of all the existing features.
For degree-d polynomials, the polynomial kernel is defined as:
where $x$ and $y$ are vectors in the input space, i.e. vectors of features computed from training or test samples.
One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel which is also the default kernel used within the sklearn’s SVM classification algorithm. This function is as follows:
The minus sign inverts the distance measure into a similarity score and, due to the exponential term, the resulting similarity score will fall into a range between 1 (for exactly similar samples) and 0 (for very dissimilar samples).
Here, $\gamma = {{1} \over {2 \sigma^2}}$ is a free parameter that is to be optimized. The optimization of $\gamma$ also plays an important role in controlling overfitting.
In the figure given, we can see an example of decision boundary around the classes 0 and 1 using a RBF kernel with relatively large value of $\gamma$.